Jared - pretty much everything, Maria wrote some priors that didn't end up being used.
We often want to ascertain how tightly two proteins are bound by measuring their dissociation constant, $K_d$. This is usually done by doing a titration experiment and then performing a regression. For example, imagine two proteins, a and b may bind to each other in the reaction
\begin{align} \text{ab} \rightleftharpoons \text{a} + \text{b} \end{align}
with dissociation constant $K_d$. At equilibrium
\begin{align} K_d = \frac{c_a\,c_b}{c_{ab}}, \end{align}
were $c_i$ is the concentration of species $i$. If we add known amounts of a and b to a solution such that the total concentration of a is $c_a^0$ and the total concentration of b is $c_b^0$, we can compute the equilibrium concentrations of all species. Specifically, in addition to the equation above, we have conservation of mass equations,
\begin{align} c_a^0 &= c_a + c_{ab}\\[1em] c_b^0 &= c_b + c_{ab}, \end{align}
fully specifying the problem. We can solve the three equations for $c_{ab}$ in terms of the known quantities $c_a^0$ and $c_b^0$, along with the parameter we are trying to measure, $K_d$. We get
\begin{align} c_{ab} = \frac{2c_a^0\,c_b^0}{K_d+c_a^0+c_b^0 + \sqrt{\left(K_d+c_a^0+c_b^0\right)^2 - 4c_a^0\,c_b^0}}. \end{align}
The technique, then, is to hold $c_a^0$ fixed in the experiment and measure $c_{ab}$ for various $c_b^0$. We can then perform a regression to get $K_d$.
In order to do this, though, we need some readout of $c_{ab}$. For this problem, we will use FRET (fluorescence resonance energy transfer) to monitor how much of a is bound to b. Specifically, we consider a to have a fluorophore and b to be its receptor. When the two are unbound, we get a fluorescence signal per molecule of $f_0$. When they are bound, the receptor absorbs the light coming out of the fluorophore, so we get less fluorescence per molecule, which we will call $f_q$ (for "quenched"). Let $f$ be the total per-fluorophore fluorescence signal. Then, the measured fluorescence signal, $F$, is
\begin{align} F = c_a^0\,V f = \left(c_a \,f_0 + c_{ab}\, f_q\right)V, \end{align}
where $V$ is the reaction volume. We can absorb $V$ into the other parameters such that $\hat{f}_0 = f_0 V$ and $\hat{f}_q = f_q V$, giving
\begin{align} F = \hat{f}_0(c_a^0 - c_{ab}) + \hat{f}_q\, c_{ab} = \hat{f}_0\,c_a^0 - \frac{2(\hat{f}_0 - \hat{f}_q)c_a^0\,c_b^0}{K_d+c_a^0+c_b^0 + \sqrt{\left(K_d+c_a^0+c_b^0\right)^2 - 4c_a^0\,c_b^0}}. \end{align}
Compute parameter estimates for $K_d$ with and without an outlier detection scheme for this data set. How do the results differ depending on whether or not you were trying to detect outliers?
Note: These are real data, but they are from an unpublished experiment here on campus. I therefore have not exposed the identities of the proteins a and b.
import glob
import os
import sys
import numpy as np
import pandas as pd
import scipy.signal
# Import Altair for high level plotting
import altair as alt
import altair_catplot as altcat
# Pevent bulky altair plots
alt.data_transformers.enable('json')
import bebi103
import bokeh
from bokeh.palettes import all_palettes
import itertools
from bokeh.models import Legend, LegendItem
bokeh.io.output_notebook()
First let's read in the data:
df = pd.read_csv("../data/fret_binding_curve.csv", comment = '#')
df
And take a look:
plot1 = alt.Chart(df).mark_point(color='red').encode(
x = alt.X("b conc (nM)"),
y = alt.Y("fluorescence")
).interactive()
plot1
Hmmm I wonder if there are any outliers haha.
Given our known quantities and our goal, the relevant equation is: \begin{align} F = \hat{f}_0\,c_a^0 - \frac{2(\hat{f}_0 - \hat{f}_q)c_a^0\,c_b^0}{K_d+c_a^0+c_b^0 + \sqrt{\left(K_d+c_a^0+c_b^0\right)^2 - 4c_a^0\,c_b^0}}. \end{align}
The concentration of a is a set quantity in this experiment and that of b is our independent variabl, so we need parameter estimates of $K_d$, $\hat{f}_0$, and $\hat{f}_q$. Additionally we want to include a Gaussian error term $\sigma$ to reflect errors in measurement. Our priors are as follows:
$$\text{fl} ~ \text{Norm}(F(c_a^0), \sigma).$$
Thus we expect the observed flourescence to be gaussian distributed around the value predicted by the equatin.
$K_d = e^{\Delta g/RT}$ This is the definition of $K_d$ in terms of $\Delta_g$
$\Delta G/RT$ ~ Norm(0, 1) We're not certain what to expect with the $\Delta_g$ values, whether they will be positive or negative, so we will approximate them (normalized with the gas constant and the temperature) to be normally distributed around 0.
$\hat{f}_0$ ~ Norm(10000, 1000) Flourescence is an ununited relative value, but in this experiment it appears to be between 100000 and 500000. Since $F= \hat{f}_0\,c_a^0$, we expect the highest fluorescence value to be $50*\hat{f}_0$ so we will choose a distribution with a wide variation at the top of this scale.
$\hat{f}_q$ ~ Norm(5000, 500) We expect quenched molecules to have a much smaller fluorescence than the unquenched molecules, so we will choose a distribution half that of $\hat{f}_0$
$\sigma$ ~ Half-Norm(0, 5000): The standard deviation of the Gaussian distribution of the measurments. Additionally, we know that $c_a^0$ = 50 nM And we know that $c_b^0$ is given by the values in the dataframe:
data = dict(N=len(df),
cb=df['b conc (nM)'].values.astype(float))
We code this up into a prior predictive check
sm = bebi103.stan.StanModel(file='./9.2_prior.stan')
We do our sampling and load the samples into a dataframe:
samples = sm.sampling(data=data, algorithm="Fixed_param",
warmup=0,
chains=1,
iter=1000)
df_gen = bebi103.stan.to_dataframe(samples, diagnostics=False)
We use dataframe methods to look at the fluorescence values for each sample
# Define the columns we want to take out
cols = ["fl[1]", "fl[2]", "fl[3]", "fl[4]", "fl[5]",
"fl[6]", "fl[7]", "fl[8]", "fl[9]", "fl[10]"]
# Take out the columns
df_gen_fl = df_gen[cols]
# Reindex
df_gen_fl = df_gen_fl.stack(level=0)
df_gen_fl = df_gen_fl.reset_index()
df_gen_fl = df_gen_fl.sort_index(level=1)
# Renaming the columns with useful names
df_gen_fl = df_gen_fl.rename(columns={'level_0': 'model',
'level_1': 'cb',
0:'fluorescence signal'})
# Renaming the conditions with more descriptive names
for i in range(len(cols)):
df_gen_fl = df_gen_fl.replace({cols[i]:data['cb'][i]})
# Take a look
alt.Chart(df_gen_fl).mark_line().encode(
x = alt.X('cb'),
y = alt.Y('fluorescence signal'),
color = alt.Color('model')).interactive()
This looks good! The flourescence values appear to be how we would expect them, in the correct order of magnitude and dependent on cb. Let's code this up into a stan model:
sm2 = bebi103.stan.StanModel(file='./9.2_model.stan')
Load up our data and sample:
data = dict(N=len(df),
cb=df['b conc (nM)'].values.astype(float),
fl=df['fluorescence'].values.astype(float))
samples2 = sm2.sampling(data=data,
control=dict(adapt_delta = .9999999,
max_treedepth = 12),
warmup=2000,
iter=10000,
thin=5)
bebi103.stan.check_all_diagnostics(samples2)
That looks like it worked fine, so let's load it up into a dataframe and take a look:
dfmcmc = bebi103.stan.to_dataframe(samples2, diagnostics=False, inc_warmup=False)
dfmcmc.head()
Let's check the corner plot to make sure everything looks the way we would expect:
bokeh.io.show(bebi103.viz.corner(samples2,
pars=['Delta_G', 'f0', 'fq','sigma']))
Let's look at the posterior predictive check to verify our model is doing what we think:
p = bebi103.viz.predictive_ecdf(samples2,
name='fl_ppc',
data=df["fluorescence"],
x=np.linspace(200000, 500000),
percentiles=[99, 80, 60, 40])
p.plot_width=800
bokeh.io.show(p)
This looks good, but because we lose the cb information in the ECDF format I'd like to also plot the fluorescence values predicting in the posterior predictive checks against the cb values:
def check_ppc(df):
# Define the columns we want to take out
cols = ["fl_ppc[1]", "fl_ppc[2]", "fl_ppc[3]", "fl_ppc[4]", "fl_ppc[5]",
"fl_ppc[6]", "fl_ppc[7]", "fl_ppc[8]", "fl_ppc[9]", "fl_ppc[10]"]
# Take out the columns
df_fl = df[cols]
# Reindex
df_fl = df_fl.stack(level=0)
df_fl = df_fl.reset_index()
df_fl = df_fl.sort_index(level=1)
# Renaming the columns with useful names
df_fl = df_fl.rename(columns={'level_0': 'model',
'level_1': 'cb',
0:'fluorescence signal'})
# Renaming the conditions with more descriptive names
for i in range(len(cols)):
df_fl = df_fl.replace({cols[i]:data['cb'][i]})
# Take a look
plot2 = alt.Chart(df_fl).mark_point().encode(
x = alt.X('cb', title = "b conc (nM)"),
y = alt.Y('fluorescence signal')).interactive()
return plot2 + plot1
check_ppc(dfmcmc)
This looks pretty good. We're getting a wide variation around each of the points, likely due to the model trying to catch the outlier. Let's now estimate the Kd from this model:
# First let's make a helper plotting function
def make_ecdf(dataframe, data):
c = altcat.catplot(data=dataframe,
mark='line',
encoding=dict(x=alt.X(data,
scale=alt.Scale(
clamp=True))),
transform='ecdf'
).properties(height=300,
width=300).interactive()
return c
# And now apply it
make_ecdf(dfmcmc, "Kd:Q")
This doesn't look like the mean would be a good statistic, so instead let's report the median.
Kd = dfmcmc["Kd"].median()
Kd
Now let's attempt to deal with the outliers. We will start by using the student t model. The only difference between this model and our original is that we have our data distributed using the student t distribution instead of the Gaussian distribution, so $$\text{fl}~ \text{Student_t}(\nu, F(c_a^0), \sigma).$$
Where $\nu$ ~ HalfNorm(1,100), since $\nu$ is bounded by 1 at the bottom and corresponds to how Gaussian the distribution is.
Let's load up this stan model:
sm3 = bebi103.stan.StanModel(file='./9.2_student_model.stan')
And sample:
samples3 = sm3.sampling(data=data,
control=dict(adapt_delta = .9999999,
max_treedepth = 12),
warmup=2000,
iter=10000,
thin=5)
bebi103.stan.check_all_diagnostics(samples3)
Diagnostics look good, let's load this into a dataframe and take a look:
df_student = bebi103.stan.to_dataframe(samples3, diagnostics=False, inc_warmup=False)
df_student.head()
And take a look at the parameters:
bokeh.io.show(bebi103.viz.corner(samples3,
pars=['Delta_G', 'f0', 'fq','sigma', 'nu']))
This looks as expected, so let's take a look at our posterior predictive checks:
p = bebi103.viz.predictive_ecdf(samples3,
name='fl_ppc',
data=df["fluorescence"],
x=np.linspace(200000, 500000),
percentiles=[99, 80, 60, 40])
p.plot_width=800
bokeh.io.show(p)
This seems to fit our values pretty well but let's take a look at the fluorescence values vs the b concentration:
check_ppc(df_student)
This model is generating a lot of outliers to compensate for those in our data set. Let's compare the Kd values for this model and our original:
make_ecdf(df_student, "Kd:Q") | make_ecdf(dfmcmc, "Kd:Q")
These look rather similar. Let's compute the median:
Kd2 = df_student["Kd"].median()
Kd2
Finally, let's use a Good/Bad model for outlier detection and see what effect that has. We thus use a mixture model in which each data point has a probability $w_i$ of being good, with error bar $\sigma$, and otherwise has error bar $\sigma_bad$>$\sigma$. Both of our sigmas will be pulled from the same distribution (HalfNorm(0,5000)) we used in previous models, and our
$w_i$ ~ Beta(2,5)
Let's load up the model:
sm4 = bebi103.stan.StanModel(file='./9.2_good_model.stan')
And sample:
samples4 = sm4.sampling(data=data,
control=dict(adapt_delta = .9999999,
max_treedepth = 12),
warmup=2000,
iter=10000,
thin=5)
bebi103.stan.check_all_diagnostics(samples4)
Diagnostics look good, let's convert this to a dataframe and take a look:
df_good = bebi103.stan.to_dataframe(samples4, diagnostics=False, inc_warmup=False)
df_good.head()
Take a look at parameters:
bokeh.io.show(bebi103.viz.corner(samples4,
pars=['Delta_G', 'f0', 'fq','sigma[1]','sigma[2]']))
There's some odd clustering with the sigma[1] and the f0 and fq values, but no divergences. Let's do our posterior predictive checks:
p = bebi103.viz.predictive_ecdf(samples4,
name='fl_ppc',
data=df["fluorescence"],
x=np.linspace(200000, 500000),
percentiles=[99, 80, 60, 40])
p.plot_width=800
bokeh.io.show(p)
This model clearly fits our data well.
check_ppc(df_good)
This model is more reminiscent of the original model than the student-t one as it is generating far fewer large outliers, but it still incorporates all our data.
make_ecdf(df_good, "Kd:Q") | make_ecdf(dfmcmc, "Kd:Q")
These also look very similar. Let's report the value of the median:
Kd3 = df_good["Kd"].median()
Kd3
Finally, let's report the values predicted for Kd by each of the models we used to compare:
df_res = pd.DataFrame(columns=['method', 'hbd_low', 'median', 'hbd_high'])
kd_samples = samples2.extract('Kd')['Kd']
df_res = df_res.append(pd.DataFrame({'method': ['mcmc'],
'hbd_low': [bebi103.stan.hpd(kd_samples, mass_frac=0.95)[0]],
'median': [np.median(kd_samples)],
'hbd_high': [bebi103.stan.hpd(kd_samples, mass_frac=0.95)[1]]}),
ignore_index=True)
kd_samples = samples3.extract('Kd')['Kd']
df_res = df_res.append(pd.DataFrame({'method': ['student_t'],
'hbd_low': [bebi103.stan.hpd(kd_samples, mass_frac=0.95)[0]],
'median': [np.median(kd_samples)],
'hbd_high': [bebi103.stan.hpd(kd_samples, mass_frac=0.95)[1]]}),
ignore_index=True)
kd_samples = samples4.extract('Kd')['Kd']
df_res = df_res.append(pd.DataFrame({'method': ['good_bad'],
'hbd_low': [bebi103.stan.hpd(kd_samples, mass_frac=0.95)[0]],
'median': [np.median(kd_samples)],
'hbd_high': [bebi103.stan.hpd(kd_samples, mass_frac=0.95)[1]]}),
ignore_index=True)
# Take a look
df_res
The good_bad model has a tighter distribution of Kd values than either of the other two models, and the median value for Kd is lower for the good_bad model than the other two. The student_t model and the original model yield very similar values for Kd.
Finally, I will print all the stan code used here.
print("-------------MODEL CHECK 1-------------")
print(sm.model_code)
print("-------------MODEL CHECK 2-------------")
print(sm2.model_code)
print("-------------MODEL CHECK 3-------------")
print(sm3.model_code)
print("-------------MODEL CHECK 4-------------")
print(sm4.model_code)